Added a SAMTOOLS_PIPELINE to run multiple samtools commands at once by muffato · Pull Request #4571 · nf-core/modules

muffato · 2023-12-11T17:15:43Z

I've recently spent some time optimising the resource usage of some Nextflow pipelines and I found that the samtools workflow to mark duplicates takes 5x longer to run and 3x more disk space when implemented the nf-core way, i.e. as 4 modules put together in a sub-workflow, compared to having a single local module with a bash pipeline.

Last time we discussed this sort of module, #3310 and https://nfcore.slack.com/archives/CJRH30T6V/p1682062426042809 , it was decided against though we didn't know the actual improvements. But both @drpatelh and @SPPearce suggested to reopen the discussion after seeing the figures. After all, the nf-core guidelines don't forbid it (bold type mine).

Software that can be piped together SHOULD be added to separate module files unless there is a run-time, storage advantage in implementing in this way.

However, @drpatelh said that there should be options to disable some of the 4 commands, because not anyone may need them. Here is my take on this.

I introduce a new module that can run a bash pipeline (hence the name) of any commands the user wants in any order, as long as they take BAM/CRAM as inputs as outputs. It wasn't as trivial as I thought it would because samtools commands have different ways of defining the input and output files 🙃

It works like this:

workflow {
    input = [
        [ id:'test', single_end:false ], // meta map
        file(params.test_data['sarscov2']['illumina']['test_paired_end_bam'], checkIfExists: true)
    ]
    commands = ['collate', 'fixmate', 'sort', 'markdup']
    SAMTOOLS_PIPELINE ( input, [[],[]], commands )
    SAMTOOLS_PIPELINE.out.output
}
process {
    withName: SAMTOOLS_PIPELINE {
        ext.args = [
            fixmate: '-m',
        ]
    }
}

What I like about the approach:

No mulled container involved. Just the regular samtools, which is much easier to understand and update.
Easy to add more samtools commands to the module.
Maximum flexibility for the user in terms of chaining samtools commands.

What I dislike about the implementation:

I couldn't find how to dynamically get the value of task.ext.args${index}. So instead I make ext.args a map where the key is the command name and the value is the string to add to the command-line. I don't know if/how closures can be used. I would very much prefer to use the regular task.ext.args, task.ext.args2, etc.
Because of the above, my implementation doesn't handle the same command being used multiple times in the pipeline with different command-line options.
The configuration of the module is actually split between the .nf (commands = ['collate', 'fixmate', 'sort', 'markdup']) and the .config (ext.args = [ ... ]). Maybe I should move the definition of the commands to the .config ?
Indentation of the resulting .command.sh is very fragile. What I've got works but for instance I tried pipeline_command += "samtools ${this_command} ${all_args[index]} \\ \n" and that caused the cat <<-END_VERSIONS > versions.yml` to be indented, which broke the generation of the yml.

What it needs:

testing with CRAM. I'm no sure I put the --reference in the right place.
migration to nf-test.

What it may need:

The selected commands often have command-line options to take extra files as inputs or output more files. Those could be added to the module, though to be fair the existing modules often don't list every single possible option either.
Some samtools-based sub-workflows could be rewritten to use this module, bringing an instant performance boost to their workflows.
A different name ? I chose "pipeline" because it's a bash pipeline under the hood. @mashehu doesn't like it :) Alternatives: pipe, chain, any, multi, combine. Ideally it should be name that stands out from the already-crowded list of samtools commands
A simple sub-workflow wrapper so that people looking for samtools sub-workflows still find about this. (they may not think about checking modules)
Alternatively, only expose this capability as a sub-workflow, either by allowing "local" modules in nf-core sub-workflows and moving the file over there, or by embedding this process block in the sub-workflow's main.nf.

PR checklist

maxulysse · 2023-12-11T17:18:05Z

Love the idea

mashehu

Please don't call it pipeline, that is already too overloaded.
I am also not 100% sure about the universality of this. Feels like we are basically moving away from atomic modules again and one needs to dig through several configs to read through several configs and docs to figure out what the module/command is actually doing

muffato · 2023-12-11T17:47:10Z

Hi guys. You were quick 😉 ! I am still testing a few things locally. I was then going to fill in the body of the PR to explain how this came up and share the link on Slack for discussion. I know we already discussed "samtools sormadup" and decided against, so I don't expect an easy ride 😅 but I think there are good reasons to still do something on this topic.

@pg

@pg header

SPPearce · 2023-12-13T07:30:50Z

As someone whose cluster is currently running out of scratch space, I definitely support the idea of chaining multiple samtools operations into one module, producing lots of individual bam/cram files uses a lot of memory when not necessary.
We already pipe bwa directly into samtools for this exact reason, bam/cram files can be BIG.

mahesh-panchal · 2023-12-13T11:31:33Z

I definitely approve of the idea.

In terms of implementation, I would skip the for loop and only use if then else to make an array, and then join the string array with the pipe symbol when evaluating the string array in it's execution place. I.e.
L94 would look like:

${samtools_commands.join(' | ')}

I'm starting to think we need to revisit also how args are passed to modules. I would prefer that the args are indexed with the 'tool_subtool' key in this case for readability.

I agree with the overloading of word pipeline too. I like the word compose but I'm not sure how to integrate it as I think the purpose of the composition also needs to be clear in the name as there could be quite a few compositions of a single tool after a time.

muffato · 2023-12-13T12:12:22Z

In terms of implementation, I would skip the for loop and only use if then else to make an array, and then join the string array with the pipe symbol when evaluating the string array in it's execution place. I.e.

I like the idea of join. Will definitely implement it. But I still need a for loop of some sort to go over the command names, right ? If not an actual for, then perhaps a .collect with a closure ?

mahesh-panchal · 2023-12-16T15:04:25Z

I can't read properly on the phone ...
I'm not sure what to call these non-atomic modules. I think the name should reflect the composition, but perhaps following the subworkflow naming scheme works best here.

SPPearce · 2023-12-17T16:36:40Z

How does something very explicit, like samtools multicommand sound?
I don't think we want a lot of these type of modules, only very common ones like samtools (bedtools perhaps).

mahesh-panchal · 2023-12-18T08:03:20Z

I think it would be better if we separated these aggregate commands in their own folder, like subworkflows are. It makes them easier to find, and highlights they're not atomic. Perhaps call the folder aggregates. I like the samtools_multicommand. I was thinking samtools_composepipe, or samtools_streamcommands, but I think they sound odd.

muffato · 2023-12-18T11:05:51Z

I think there's too much emphasis on the (non-)atomicity of this. The nf-core guidelines don't prohibit multiple commands running in a module. There is already a lot of modules that run multiple commands (the best proxy is to search for mulled containers: they typically indicate that the module needs another tool from another package).

In my mind, I map "modules" to Nextflow process and "sub-workflow" to Nextflow workflow, and I'm not keen on introducing another term for non-atomic modules, which are still process. Or I would argue to move all the existing non-atomic modules over there ;)

As an alternative, what about naming this: modules/nf-core/custom/multisamtools ? Still a "module" (because a process) but under the custom group because it's obviously customised.

mahesh-panchal · 2023-12-18T15:04:56Z

While the guidelines don't prohibit multiple commands the reason is because the modules would be impractical without them. The modules are supposed to be as simple as possible without making them impractical. There may also be examples where this isn't the case, but there several modules that don't fully follow guidelines.

For findability, I do think this needs to be separated in some way. This kind of process definitely follows what the purpose of subworkflows are supposed to be.

As it stands, there are also several modules that are only used in a certain sequence and would benefit from being aggregated into a single process. This is part of a larger issue to me.

SPPearce · 2023-12-18T16:12:21Z

I think this could slot into the subworkflows folder, rather than into a separate folder. I don't expect to be making too many of these chained modules like this.

muffato · 2023-12-19T13:27:45Z

Well, I did mention sub-workflows in my OP

A simple sub-workflow wrapper so that people looking for samtools sub-workflows still find about this. (they may not think about checking modules)

Alternatively, only expose this capability as a sub-workflow, either by allowing "local" modules in nf-core sub-workflows and moving the file over there, or by embedding this process block in the sub-workflow's main.nf.

The first technical question is: can nf-core sub-workflows have "local" modules ? I can't really see room for that in the repository

mahesh-panchal · 2023-12-19T15:34:55Z

@nf-core/core we need a decision where to place this, because this is a necessity.

adamrtalbot · 2023-12-19T16:27:39Z

I'm all in favour of more piping, I think we need to use it much much more to speed our pipelines up.

This is a pretty cool solution, but fairly complex? I'm not sure it will be very obvious to a new user how to give it a go. A number of modules that are defined by what they do is a simpler approach, fits more naturally with nf-core/modules and is more approachable to new users. e.g., let's imagine SAMTOOLS_BAMTOFASTQ process:

samtools collate -u -O $bam \
    | samtools fastq -1 ${prefix}.1.fastq.gz -2 ${prefix}.2.fastq.gz -0 /dev/null -s /dev/null -n

Clear, readable, quick to understand. In a pipeline it's obvious what it's doing:

SAMTOOLS_BAMTOFASTQ(bam)

This is the current solution, which is fine but not obvious to a new developer?

include { SAMTOOLS_PIPELINE as SAMTOOLS_BAMTOFASTQ }

SAMTOOLS_BAMTOFASTQ(bam, [], ['collate', 'fastq'])

muffato · 2023-12-19T20:11:50Z

@adamrtalbot . Then maybe #3310 could be reopened because it's exactly what you're advocating for ?

This PR is for a follow-up module that allows enabling/disabling the components of the command, without having to write a new module for every combination.

FYI my PR is currently only for BAM→BAM transformations. samtools fastq is not yet supported.

mahesh-panchal · 2023-12-19T20:28:03Z

While this would be complex to a newcomer, they still have the option stick to the more traditional path.

I'm quite happy for processes like these to require their own documentation because the resource benefits are significant. In my opinion, there is no way to have best practice pipelines and also have everything newcomer friendly. We already do things that are not newcomer friendly like the way some parameters are separated into input: while the rest come in through another route. I know why it's there, but it still trips up every person I've taught.

These processes could also help serve to improve developer skills by demonstrating practical examples showing groovy tips and tricks, and better resource usage.

For pipeline readability, the developer can alias the name to something more inline what they intend to use the generalised process for. The rest of the complexity won't matter as long as it brings the benefits it was intended for.

adamrtalbot · 2023-12-20T11:49:03Z

@adamrtalbot . Then maybe #3310 could be reopened because it's exactly what you're advocating for ?

I think so. We've got a bit dogmatic about "1 tool, 1 module", to the point of losing sight of why that rule exists. In my mind, a module should be an single unit of work, not a piece of software. It's often the case that these two are the same but not always and sometimes nf-core pipelines are absurdly inefficient because everything runs in it's own module.

We already do things that are not newcomer friendly

Then let's make things easier to use and more friendly. Not more complicated because reasons.

These processes could also help serve to improve developer skills by demonstrating practical examples showing groovy tips and tricks, and better resource usage.

Nah, they'll just make a local module in their pipeline.

To be fair, I quite like this module, think it's really good code and love the flexibility. I'd happy accept this under the golden rule: Minimum Useful Thing™️, but I'm firmly in the camp we should be simplifying and reducing nf-core, not adding complexity.

mahesh-panchal · 2023-12-20T15:38:42Z

We've got a bit dogmatic about "1 tool, 1 module", to the point of losing sight of why that rule exists. In my mind, a module should be an single unit of work, not a piece of software.

This would be a positive change, but is everyone in agreement (to generally allow multiple commands)?

These processes could also help serve to improve developer skills by demonstrating practical examples showing groovy tips and tricks, and better resource usage.

Nah, they'll just make a local module in their pipeline.

Then I would argue we're not annotating modules well enough and not making them findable. I think to help others, it would be useful if we added more comments to modules explaining anything that might be complicated.

To be fair, I quite like this module, think it's really good code and love the flexibility. I'd happy accept this under the golden rule: Minimum Useful Thing™️, but I'm firmly in the camp we should be simplifying and reducing nf-core, not adding complexity.

I think we're in agreement here to make things as simple as possible. I don't think anyone wants to add complexity unless it's a necessity.

muffato · 2024-01-10T21:38:55Z

FYI, there is already a module that runs samtools commands with a pipe: https://github.com/nf-core/modules/blob/master/modules/nf-core/samtools/collatefastq/main.nf
It does samtools collate | samtools fastq, which could very well be achieved with the two existing modules SAMTOOLS_COLLATE and SAMTOOLS_FASTQ, but is of course more efficient as a pipe in a single process

mahesh-panchal · 2024-01-11T09:04:12Z

In my opinion that's one of the modules that doesn't follow guidelines and shouldn't be in the modules folder ( or more specifically should not have been approved - the placement is still up for discussion). However, generally, this too should have a place along side the process in this PR and the other optimized processes suggested.

adamrtalbot · 2024-01-11T10:18:41Z

samtools fastq/fasta must take an input from samtools collate, otherwise it generate illogical but valid FASTQ file, i.e. a silent error, see the examples. Including a samtools fastq/fasta module is pretty dangerous because someone naive might install it without realising.

We shouldn't follow the guidelines if they cause people to create broken pipelines.

mahesh-panchal · 2024-01-11T12:00:47Z

That's only for read pairs, that the input needs to be collated.

mahesh-panchal · 2024-01-11T12:05:37Z

But I also hope none of our guidelines cause people to make broken pipelines. At the end of the day, it is up to the developer to check that their code is correct, hopefully through rigorous testing and peer-review.

matthdsm · 2024-02-29T08:36:32Z

Can anybody add what was decided in the last maintainers meeting? I couldn't make it, but I'm very interested in either this or my sormadup proposal.

mahesh-panchal · 2024-02-29T08:45:32Z

See nf-core/website#2327 for a basic summary and start on guidelines. @maxulysse is in charge of this. Basically this PR might not be maintainable by the average nf-core user (most maintainers were not in favor of it), and maintainers would like the script to be explicit with the tool, in particular so argsN maps to a particular tool/subcommand (i.e. like yours). However, they're also worried about maintenance burden, and since one can make local modules, these compound processes should likely go there unless they can be used across multiple pipelines. There's still space to provide more opinions though.

matthdsm · 2024-02-29T08:48:46Z

Thanks @mahesh-panchal!
@maxulysse, In light of those decisions, should I put more work towards the sormadup module? I'm sure it can be used elsewhere, seeing there's an entire subworkflows that does the same, albeit slower and more resource intensive.
I think @priyanka-surana would be interesed too for the the sanger-tol pipelines.

priyanka-surana · 2024-02-29T09:19:47Z

@matthdsm We (@muffato) are already using your implementation as a local module. Would be great to have this directly from nf-core.

muffato · 2024-02-29T09:23:01Z

Yes @matthdsm . In sanger-tol we've got sormadup already (almost exactly your version, the only change I did was removing the sort_memory), but also soon collate | fasta and view | fastq

maxulysse · 2024-02-29T12:57:51Z

I'm in favor of resurrecting #3310, that sounds like a good candidate for a meta module that will be wonders in term of data footprint savings

SPPearce · 2024-02-29T13:27:45Z

I was a dissenting voice on this, I like the flexibility of this "module" and would merge this in.
But yes, modules chaining together multiple tools are acceptable, it was just the complexity of this one that is not currently in favour.

Added a SAMTOOLS_PIPELINE to run multiple samtools commands at once

2ad8e73

muffato added new module Adding a new module WIP Work in progress labels Dec 11, 2023

muffato self-assigned this Dec 11, 2023

mashehu reviewed Dec 11, 2023

View reviewed changes

muffato added 14 commits December 11, 2023 18:05

Just call it sormadup

9c3ec9d

Better text

a2f9f58

Added a stub

12085a4

Introduced helper variables

56f81e8

Directly modify the main variable

c8332a5

More commands (and syntaxes)

a85140a

Expanded support

41d9fba

Made .command.sh more legible

d409456

Documentation update

f33f441

bugfix: reheader cannot use multiple CPUs

6d7722d

Updated the tests and added one that uses all possible commands

c102c93

Can't use TMPDIR as it's set to a non-writable location in Singularity

915e5af

Need to update the checksum to include the new -T option that is in the

4f10688

@pg header

Documentation update

a8859c7

muffato marked this pull request as ready for review December 13, 2023 08:32

muffato requested review from a team and vagkaratzas December 13, 2023 08:32

muffato added Ready for Review and removed WIP Work in progress labels Dec 13, 2023

nvnieuwk reviewed Dec 13, 2023

View reviewed changes

Comment thread modules/nf-core/samtools/pipeline/main.nf Outdated

mahesh-panchal reviewed Dec 16, 2023

View reviewed changes

Comment thread modules/nf-core/samtools/pipeline/main.nf

mahesh-panchal mentioned this pull request Feb 23, 2024

Add additional information for multi-tool modules nf-core/website#2327

Merged

matthdsm mentioned this pull request Feb 29, 2024

New module samtools/sormadup #5012

Merged

17 tasks

muffato closed this Mar 2, 2024

muffato deleted the samtools_pipeline branch March 2, 2024 09:43

Conversation

muffato commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

maxulysse commented Dec 11, 2023

Uh oh!

mashehu left a comment

Choose a reason for hiding this comment

Uh oh!

muffato commented Dec 11, 2023

Uh oh!

SPPearce commented Dec 13, 2023

Uh oh!

Uh oh!

mahesh-panchal commented Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muffato commented Dec 13, 2023

Uh oh!

mahesh-panchal commented Dec 16, 2023

Uh oh!

Uh oh!

SPPearce commented Dec 17, 2023

Uh oh!

mahesh-panchal commented Dec 18, 2023

Uh oh!

muffato commented Dec 18, 2023

Uh oh!

mahesh-panchal commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SPPearce commented Dec 18, 2023

Uh oh!

muffato commented Dec 19, 2023

Uh oh!

mahesh-panchal commented Dec 19, 2023

Uh oh!

adamrtalbot commented Dec 19, 2023

Uh oh!

muffato commented Dec 19, 2023

Uh oh!

mahesh-panchal commented Dec 19, 2023

Uh oh!

adamrtalbot commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mahesh-panchal commented Dec 20, 2023

Uh oh!

muffato commented Jan 10, 2024

Uh oh!

mahesh-panchal commented Jan 11, 2024

Uh oh!

adamrtalbot commented Jan 11, 2024

Uh oh!

mahesh-panchal commented Jan 11, 2024

Uh oh!

mahesh-panchal commented Jan 11, 2024

Uh oh!

matthdsm commented Feb 29, 2024

Uh oh!

mahesh-panchal commented Feb 29, 2024

Uh oh!

matthdsm commented Feb 29, 2024

Uh oh!

priyanka-surana commented Feb 29, 2024

Uh oh!

muffato commented Feb 29, 2024

Uh oh!

maxulysse commented Feb 29, 2024

Uh oh!

SPPearce commented Feb 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

muffato commented Dec 11, 2023 •

edited

Loading

mahesh-panchal commented Dec 13, 2023 •

edited

Loading

mahesh-panchal commented Dec 18, 2023 •

edited

Loading

adamrtalbot commented Dec 20, 2023 •

edited

Loading